The dataset has 10 variables, including carat, cut, color, depth, table, price, x(length), y(width), and z(depth). It contains 53,940 records of different diamonds. The main variables we will cover in this analysis are carat, cut, color, and price. We will try to find the realtionship between price and other variables.
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
The range of carat is 0.2 - 5.01. The median carat size is 0.7, and mean carat size is 0.8. From the histogram, we can see that most diamonds’ carat range are in 0.2 - 0.3, and 0.9 - 1.0. Around 75% diamonds are not greater than 1 carat.
summary(diamonds$carat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2000 0.4000 0.7000 0.7979 1.0400 5.0100
range(diamonds$carat)
## [1] 0.20 5.01
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.1,fill ="#39568CFF")
There is only few diamonds over 3 carat, we will ignore them and zoom in the carat range from 0 to 3. The peaks are usually around .0, 0.25, 0.5 & 0.75.
diamonds %>% filter(carat < 3)%>%
ggplot()+
geom_histogram(mapping = aes(x = carat), binwidth = 0.01,fill ="#39568CFF")
### Cut Most diamonds have ideal cut, followed by premium cut and very good cut. From the pie chart, we can see the percentage more clearly. 40% are ideal, 25.6% are premium, and only under 3% are fair cut.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
p <- c( '#440154FF','#39568CFF','#20A387FF','#95D840FF','#FDE725FF')
plot_ly() %>%
add_pie(data = count(diamonds, cut), labels = ~cut, values = ~n,
name = "Cut", domain = list(row = 0, column =0), marker = list(colors = p))
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x= cut, y = carat))
### Color
Color is categorized from J to D, J is the worst and D is the best. Color G has the most among all colors.
ggplot(data = diamonds, mapping = aes(x= color))+
geom_bar(aes(fill=color))
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
Clarity scale from worst to best is: I1(worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF(best).
ggplot(data = diamonds, mapping = aes(x= clarity))+
geom_bar(aes(fill=clarity))
### Price Price is from 326 to 18,823. Most of the diamonds are under $5,000 in this dataset.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), binwidth = 1000,fill ="#39568CFF")
ggplot(
data = diamonds,
mapping = aes(x =price , y = ..density..) )+
geom_freqpoly(mapping = aes(color = cut), binwidth = 500)
Generally speaking, price is increasing as the carat goes up.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
ggplot(data = diamonds) +
geom_point(
mapping = aes(x = carat, y = price),
alpha=1/100
)
ggplot(data = diamonds) +
geom_hex(mapping = aes(x = carat, y = price))
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) + geom_boxplot()
diamonds %>%
ggplot(aes(log(carat),log(price), col= cut))+
geom_point()
diamonds %>%
ggplot(aes(carat,price, col= color))+
geom_point()
diamonds %>%
ggplot(aes(log(carat),log(price), col= color))+
geom_point()
From the scatter plot, we can see that at certain carat, I1 is always
diamonds %>%
ggplot(aes(carat, price, col= clarity))+
geom_point()
diamonds %>%
ggplot(aes(log(carat), log(price), col= clarity))+
geom_point()